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Abstract: The problem of optimising a network of discretely firing neurons 
is addressed. An objective function is introduced which measures the average 
number of bits that are needed for the network to encode its state. When this is 
minimised, it is shown that this leads to a number of results, such as topographic 
mappings, piecewise linear dependence on the input of the probability of a 
neuron firing, and factorial encoder networks. 

1 Introduction 

In this paper the problem of optimising the firing characteristics of a network of 
discretely firing neurons will be considered. The approach adopted will not be 
based on any particular model of how real neurons operate, but will focus on the¬ 
oretically analysing some of the information processing capabilities of a layered 
network of units (which happen to be called neurons). Ideal network behaviour 
is derived by choosing the ideal neural properties that minimise an information 
theoretic objective function which specifies the number of bits required by the 
network to encode the state of its layers. This is done in preference to assuming 
a highly specific neural behaviour at the outset, followed by optimisation of a 
few remaining parameters such as weight and bias values. 

Why use an objective function in the first place? An objective function is a 
very convenient starting point (a set of “axioms”, as it were), from which every¬ 
thing else can, in principle, be derived (as “theorems”, as it were). An objective 
function has the same status as a model, which may be falsified should some 
counterevidence be discovered. The objective function used in this paper is the 
simplest that is consistent with predicting a number of non-trivial results, such 
as topographic mappings, and factorial encoders (which are discussed in this 
paper). However, it does not include any temporal information, nor any biolog¬ 
ical plausibility constraints (other than the fact that the network is assumed to 
be layered). More complicated objective functions will be the subject of future 
publications. 

In section [2] an objective function is introduced, and its connection with 
discretely firing neural networks is derived. In section [3] some examples are 
presented which show how this theory of discretely firing neural networks leads 
to some non-trivial results. 

*This paper was submitted to Special Issue of Neurocomputing on Theoretical Analysis of 
Real-Valued Function Classes on 19 January 1998. It was not accepted for publication, but 
it underpins several subsequently published papers. 
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2 Theory 

In this section a theory of discretely firing neural networks is developed. Section 
[2H] introduces the objective function for optimising an encoder, and section 12721 
shows how this can be applied to the problem of optimising a discretely firing 
neural network. 

2.1 Objective Function for Optimal Coding 

The inspiration for the approach that is used here is the minimum description 
length (MDL) method [5]. In this paper, a training set vector (which is un¬ 
labelled) will be denoted as x, a vector of statistics which are stochastically 
derived from x will be denoted as y, and their joint probability density function 
(PDF) will be denoted as Pr(x, y). The problem is to learn the functional form 
of Pr(x, y), so that vectors (x, y) sampled from Pr(x, y) can be encoded using 
the minimum number of bits on average. It is unconventional to consider the 
problem of encoding (x, y), rather than x alone, but it turns out that this leads 
to many useful results. 

Thus Pr(x, y) is approximated by a learnt model Q(x, y), in which case the 
average number of bits required to encode an (x, y) sampled from the PDF 
Pr(x, y) is given by the objective function D , which is defined as 

D = - f Pr ( x > y) logQ(x, y) (1) 

J y 

Now split D into two contributions by using Pr(x, y) = Pr(x) Pr(y|x) and 
<3(x, y) = <2(x)<2(y|x). 

D = — f dx Pr(x) E Pr(y|x) log Q(x|y) - £ Pr(y) log Q(y) (2) 

' y y 

The first term is the cost (i.e. the average number of bits), averaged over all 
possible values of y, of encoding an x sampled from Pr(x|y) using the model 
<2(x|y). This interpretation uses that Pr(x) Pr(y|x) = Pr(y) Pr(x|y). The 
second term is the cost of encoding a y sampled from Pr(y) using the model 
<3(y). Together these two terms correspond to encoding y (the second term), 
then encoding x given that y is known. 

The model Q(x, y) may be optimised so that it minimises D, and thus 
leads to the minimum cost of encoding (x, y) sampled from Pr(x, y). Ideally 
Q(x,y) = Pr(x, y), but in practice this is not possible because insufficient in¬ 
formation is available to determine Pr(x, y) exactly (i.e. the training set does 
not contain an infinite number of (x,y) vectors). It is therefore necessary to 
introduce a parametric model Q(x, y), and to choose the values of the parame¬ 
ters so that D is minimised. If the number of parameters is small enough, and 
the training set is large enough, then the parameter values can be accurately 
determined. 

A further simplification may be made if y can occupy much fewer states than 
x (given y) can, because then the cost of encoding y is much less than the cost of 
encoding x (given y) (i.e. the second and first terms in equation[2l respectively). 
In this case, it is a good approximation to retain only the first term in equation 
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[2J This approximation becomes exact if Q{ y) assigns equal probability to all 
states y, because then the third term is a constant. The reason for defining 
the objective function D as in equation [T] rather than defining it to be the first 
term of equation [21 is because equation [1] may be readily generalised to more 
complex systems, such as (x, y, z) in which Pr(x, y, z) = Pr(x) Pr(y|x) Pr(z|y), 
and so on. An example of this is given in section [3. II 

It is possible to relate the minimisation of D to the maximisation of the mu¬ 
tual information I between x and y. If the cost of encoding an x sampled from 
Pr(x) using the model Q(x) (i.e. — f dx Pr(x) logQ(x)) and the cost of encod¬ 
ing a y sampled from Pr(y) using the model Q(y) (i.e. — ]G y Pr(y) log Q( y)) are 

both subtracted from D, then the result is — f dx ]T y Pr(x, y) log ( q(1*q( y) ) • 
When Q(x,y) —> Pr(x, y) this reduces to (minus) the mutual information I 
between x and y. Thus, if the cost of encoding the correlations between x 
and y is much greater than the cost of separately encoding x and y (i.e. the 
log (Q(x) Q{y)) term can be ignored in /), then D-minimisation approximates 
/-maximisation, which is another commonly used objective function. 


2.2 Application to Neural Networks 


In order to apply the above coding theory results to a 2-layer discretely firing 
neural network, it is necessary to interpret x as a pattern of activity in the 
input layer, and y as the vector of locations in the output layer of a finite 
number of firing events. The objective function D is then the cost of using the 
model Q(x, y) of the network behaviour to encode the state (x, y) of the neural 
network (i.e. the input pattern and the location of the firing events), which 
is sampled from the Pr(x, y) that describes the true network behaviour. For 
instance, a second neural network can be used solely for computing the model 
Q(x, y), which is then used to encode the state (x, y) of the above first neural 
network. Note that no temporal information is included in this analysis, so the 
input and output of the network is a static (x, y) vector containing no time 
variables. 

These two neural networks can be combined into a single hybrid network, in 
which the machinery for computing the model Q(x, y) is interleaved with the 
neural network, whose true behavior is described by Pr(x, y). The notation of 
equation [2] can now be expressed in more neural terms, where Pr(y|x) is then 
a recognition model (i.e. bottom-up) and Q(x|y) is then a generative model 
(i.e. top-down), both of which live inside the same neural network. This is 
an unsupervised neural network, because it is trained with examples of only 
x-vectors, and the network uses its Pr(y|x) to stochastically generate a y from 
each x. 

Now introduce a Gaussian parametric model Q(x|y) 


Q(x|y) 


l 

ex P 



l|x-x'(y)|| 2 \ 
2a 2 J 


(3) 


where x'(y) is the centroid of the Gaussian (given y), a is the standard deviation 
of the Gaussian. Also define a soft vector quantiser (VQ) objective function Dy q 
as 

D V q = 2 /dxPr(x)^Pr(y|x) ||x-x'(y)|| 2 (4) 

J y 
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which is (twice) the average Euclidean reconstruction error that results when 
x is probabilistically encoded as y and then deterministically reconstructed as 
x'(y). These definitions of Q(x|y) and Dy Q allow D to be written as 

D = ^ Dy q - log (V2na^ dim x - ^ Pr(y) log Q(y) (5) 

y 

where the second term is constant, and the third term may be ignored if y can 
occupy much fewer states than x (given y) can. The conditions under which 
the third term can be ignored are satisfed in a neural network, because x is an 
activity pattern, and y as the vector of locations of a finite number of firing 
events. 

The first term of D is proportional to Dy q, whose properties may be in¬ 
vestigated using the techniques in [3]. Assume that there are n firing events, 
so that y = (yi,y 2 , ■ • ■ ,y n ), then the marginal probabilities of the symmetric 
part S[Pr(y|x)] of Pr(y|x) under interchange of its (t/i, 2/2 , • • • ,Vn) arguments 
are given by 


M 

Pr(j/i|x) = Y S[Pr(y 1 ,y 2 r-- ,y n \^)] 

V2, "• ,Vn = 1 

M 

Pr( 2 /i,y 2 |x) = Y S[Pr(y 1 ,y 2 ,--- ,y n \x)] (6) 

V3,--- ,Vn = 1 

where Pr(yi |x) may be interpreted as the probability that the next firing event 
occurs on neuron y (given x), Also define 2 useful integrals, D\ and D- 2 , as 

p r AL 

Di = - / dx Pr(x)^Pr(y|x) ||x-x'( 2 /)|| 2 

U J y =1 

D 2 = n ^ y dx Pr(x) Y Pr (di>y 2 |x) (x-x'(j/i)) • (x-x'(y 2 )07) 

yi, 2 / 2=1 

where x'(y) is any vector function of y (i.e. not necessarily related to x'(y)), to 
yield the following upper bound on Dy q 


Dy q < D 1 + D 2 


( 8 ) 


where D\ is non-negative but D 2 can have either sign, and the inequality reduces 
to an equality in the case n = 1. Thus far nothing specific has been assumed 
about Pr(y|x), other than the fact that it contains no temporal information, so 
the upper bound on Dy q applies whatever the form of Pr(y|x). 

If the firing events occur independently of each other (given x), then Pr(j/i, y 2 \x.) 
Pr(j/i|x) Pr(?/ 2 |x), which allows D 2 to be redefined as 


D 2 


2(n — 1) 
n 


dx Pr(x) 


M 

x-£ P r(dM x'(y) 
y = 1 


2 


(9) 


where D 2 is non-negative. 
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In summary, the assumptions which have been made in order to obtain the 
upper bound on Dy q in equation [5] with the definition of D\ as given in equation 
[7] and D 2 as given in equation |H] are: no temporal information is included in the 
network state vector (x, y), y can occupy much fewer states than x (given y) can, 
and firing events occur independently of each other (given x). In reality, there 
is always temporal information available, and the firing events are correlated 
with each other, so a more realistic objective function could be constructed. 
However, it is worthwhile to consider the consequences of equation [51 because 
it turns out that it leads to many non-trivial results. 

The upper bound on Dy q may be minimised with respect to all free param¬ 
eters in order to obtain a least upper bound. In the case of independent firing 
events, the free parameters are the x'(y) and the Pr(y|x). These two types of 
parameters cannot be independently optimised, because they correspond to the 
generative and recognition models implicit in the neural network, respectively. 

A gradient descent algorithm for optimising the parameter values may read¬ 
ily be obtained by differentiating D\ and D 2 with respect to x'(y) and Pr(y|x). 
Given the freedom to explore the entire space of functions Pr(y|x), the optimum 
neural firing behaviour (given x) can in principle be determined, and in certain 
simple cases this can be determined by inspection. If this option is not avail¬ 
able, such as would be the case if biological contraints restricted the allowed 
functional form of Pr(y|x), then a limited search of the entire space of func¬ 
tions Pr(j/|x) can be made by invoking parametric model of the neural firing 
behaviour (given x). 

3 Examples 

In this section several examples are presented which illustrate the use of D\ + 
D 2 in the optimisation of discretely firing neural networks. In section 13.11 a 
topographic mapping network is derived from D\ alone, in section T3.2I Prfn|xf 
that minimises D\ + D 2 is shown to be piecewise linear, and a solved example is 
presented. Finally, in section [3.31 a more detailed worked example is presented, 
which demonstrates how a factorial encoder emerges when D 1 +D 2 is minimised. 

3.1 Topographic Mapping Neural Network 

When an appropriate from of VVq is considered, it can be seen that it leads to a 
network that is closely related to Kohonen’s topographic mapping network [1], 

The derivation of a topographic mapping network that was given in [2] will 
now be recast in the framework of section 0 Thus, consider the objective 
function for a 3-layer network (x, y, z), in which (compare equation [T]) 



( 10 ) 


where the cost of encoding y has been ignored, so that effectively only a 2-layer 
network (x, z) is visible, and Dy q is given by 



( 11 ) 
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This expression for Dy q explicitly involves (x, z), but it may be manipulated 
into a form that explicitly involves (x, y). In order to make simplify this calcu¬ 
lation, Dy q will be replaced by the equivalent objective function 


DyQ = / dx Pr(x) ^Pr(z|x) J dx Pr(x» 


/|,2 


( 12 ) 


Now introduce dummy integrations over y to obtain 

M v M, M , 


/ y z y p 

dx Pr(x) P r (l/I x ) Pv{y'\z) / dx' Pr(x'|y') ||x — x'|| 

y =1 z=i y ’=1 J 

(13) 


and rearrange to obtain 


where 


r M v r 

DyQ = I dx Pr(x) ^ Pr(j/'|x) I dx' Pr(x'|y') 
y '= 1 


x — x 


M z 

Pr(y'\y) = Pr (£ /, l^) Pr (^l2/) 

z=l 

My 

Pr(y'|x) = Pv (y'\y) Pr(2/|x) 

»=i 

which may be replaced by the equivalent objective function 

, My 

Dy q = 2 I dx Pr(x) ^ Pr(j/'|x) ||x - x'(y')|| 2 

3/'=l 


(14) 


(15) 


(16) 


By manipulating Dy q from the form it has in equation [TT] to the form it has in 
equation 1161 it becomes clear that optimisation of the (x, z ) network involves 
optimisation of the (x,y') subnetwork, for which an objective function can be 
written that uses a Pr(z/|x) as defined in equation [15l When optimising the 
(x,y') subnetwork, Pr(y'\y) takes account of the effect that 2 has on y. 

If n = 1, so that only 1 firing event is observed, then Dy q = D i, and the 
optimum Pr(j/|x) must ensure that y depends deterministically on x, so that 
Pr(y|x) = <5y : y( x ) where y(x) is an encoding function that converts x into the 
index of the neuron that fires in response to x. This allows Dy q to be simplified 
to 

, My 

Dy q = 2 I dx Pr(x) ^ Pr(y'|j/(x)) ||x - x'(?/)l| 2 (17) 

y '=i 

where Pr(?/|y(x)) is Pv(y'\y) with y replaced by y{x). Note that if Pr(j/| y) = 
S y y then Dy q reduces to the objective function 2 f dx Pr(x) ||x — x'(y(x))|| 
for a standard vector quantiser (VQ). 

The optimum y(x) is given by y(x) = alg mm Pr(2/'|2/) ||x — x'( 2 /')|| 2 

y u 

(which is not quite the same as the y(x) = || x _ x'(y)|| 2 used by Ko- 

honen in his topographic mapping neural network m), and a gradient descent 
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algorithm for updating x!{y') is x!{y') —> x'(y') + e Pr(j/ , |y(x)) (which is iden¬ 
tical to Kohonen’s prescription ID)- The Pi(y'\y) may thus be interpreted as 
the neighbourhood function, and the x'(y') may be interpreted as the weight 
vectors, of a topographic mapping. Because all states y that can give rise to 
the same state z (as specified by Pr(z|j/)) become neighbours (as specified by 
Pr(y'\y) in equation [15]), Pv(y'\y) includes a much larger class of neighbourhood 
functions than has hitherto been used in topographic mapping neural networks. 

Because of the principled way in which the topographic mapping objective 
function has been derived here, it is the preferred way to optimise topographic 
mapping networks. It also allows the objective function to be generalised to the 
case n > 1, where more than one firing event is observed. 


3.2 Piecewise Linear Probability of Firing 

The optimal Pr(y|x) has some interesting properties that can be obtained by 
inspecting its stationarity condition. For instance, the Pr(y|x) that minimise 
Di + D 2 will be shown to be piecewise linear functions of x. 

Thus, functionally differentiate D\ + D 2 with respect to logPr(?/|x), where 
logarithmic differentation implicitly imposes the constraint Pr(j/|x) > 0, and 
use a Lagrange multiplier term L = f dx' A(x') ^^f =1 Pr(j/|x') to impose the 
normalisation constraint Pi'(j/|x) = 1 for each x, to obtain 

- Pr (x) Pr( 2 /|x)||x-x'(y)|| 2 
n 

--— Pr(x) Pr(y|x) x (y) ■ (x - ^ Pr(y|x) x'(y) 

71 V ?y=i 

-A(x) Pr(y|x) (18) 

The stationarity condition implies that YlyLi P r (2/I x ) ' 5 ^pt(^jx) L ^ = 

may be used to determine the Lagrange multiplier function A(x). When A(x) is 

substituted back into the stationarity condition itself, it yields 


(5 (£>1 +D 2 -L) 
<51ogPr(y|x) 


M 


0 = Pr(x) Pr(y|x) ^ (Pr(y'|x) - 5 Vt y>) 
y '=1 


x x'(y') 


x '(y') 


M 


- nx+(n - 1) ^ Pr( 2 /"|x)x'(y") 
y"=i 


(19) 


There are several classes of solution to this stationarity condition, corresponding 
to one (or more) of the three factors in equation 1191 being zero. 

1. Pr(x) = 0 (the first factor is zero). If the input PDF is zero at x, then 
nothing can be deduced about Pr(y|x), because there is no training data 
to explore the network’s behaviour at this point. 

2. Pr(j/|x) = 0 (the second factor is zero). This factor arises from the dif¬ 
ferentiation with respect to logPr(?/|x), and it ensures that Pr(y|x) < 0 
cannot be attained. The singularity in logPr(y|x) when Pr(y|x) = 0 is 
what causes this solution to emerge. 
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3. 1 (P r (2/ , | X ) — $y,v') x '(y') •(•••) = 0 (the third factor is zero). The 


solution to this equation is a Pr(y|x) that has a piecewise linear depen¬ 
dence on x. This result can be seen to be intuitively reasonable because 
D 1 + D 2 is of the form f g?x Pr(x) /(x), where /(x) is a linear combination 
of terms of the form x* Pr(y|x) : ' (for i = 0,1, 2 and j = 0,1, 2), which is a 
quadratic form in x (ignoring the x-dependence of Pr(y|x)). However, the 
terms that appear in this linear combination are such that a Pr(y|x) that 
is a piecewise linear function of x guarantees that /(x) is a piecewise linear 
combination of terms of the form x* (for i = 0,1, 2), which is a quadratic 
form in x (the normalisation constraint Y] y —i Pr(y|x) = 1 is used to re¬ 
move a contribution to that is potentially quartic in x). Thus a piecewise 
linear dependence of Pr(y|x) on x does not lead to any dependencies on x 
that are not already explicitly present in D\ + D 2 . The stationarity condi¬ 
tion on Pr(y|x) (see equation flTJl) then imposes conditions on the allowed 
piecewise linearities that Pr(y|x) can have. 

For the purpose of doing analytic calculations, it is much easier to obtain ana¬ 
lytic results with the ideal piecewise linear Pr(y|x) than with some other func¬ 
tional form. If the optimisation of Pr(y|x) is constrained, by introducing a 
parametric form which has some biological plausibility, for instance, then ana¬ 
lytic optimum solutions are not in general possible to calculate, and it becomes 
necessary to resort to numerical simulations. Piecewise linear Pr(y|x) should 
therefore be regarded as a convenient theoretical laboratory for investigating 
the properties of idealised neural networks. 


3.2.1 Solved Example 

A simple example illustrates how the piecewise linearity property of Pr(y|x) 
may be used to find optimal solutions. Thus consider a 1-dimensional input 
coordinate x £ [— 00 , + 00 ], with Pr(:r) = Pq. Assume that the number of 
neurons M tends to infinity in such a way that there is 1 neuron per unit length 
of x, so that Pr(y|a:) = p(y — a;), where the piecewise linear property gives p(x) 
as 




( 20 ) 


and by symmetry x'(y) = y. 

This Pr(y|a;) and x'(y) allow D\ to be derived as 



2s — 2a: + 1 2 


( 21 ) 


and D 2 to be derived as 




D 2 (per unit length) 


2 (n - 1) P 0 


1 f-l+s x2dx ^ 

K+ 2 f i_ s ^_^miy dx j 

= (n ~ 1)P ° (2s — l) 2 (22) 

bn 

Because there is one neuron per unit length, the contribution per unit length to 
D\ + D 2 is the sum of the above two results 


D 


P , V 

i + D -2 (per unit length) = — (n (2s — l) 2 + 4s ) 

6 n \ / 


(23) 


If Di+D 2 is differentiated with respect to s, then stationarity condition d< ' Dl d ^ D2 ^ 
0 yields the optimum value of s as 


s = 


n — 1 
2 n 


(24) 


and the stationary value of D\ + D 2 as 

D\ + D 2 (per unit length) = ^ ^ P ° (25) 

When n = 1 the stationary solution reduces to s = 0 and Di + D 2 (per unit 
length) = , which is a standard vector quantiser with nonoverlapping neural 

response regions which partition the input space into unit width quantisation 
cells, so that for all x there is exactly one neuron that responds. Although 
the neurons have been manually arranged in topographic order by imposing 
Pr(y|:r) = p(y — x), any permutation of the neuron indices in this stationary 
solution will also be stationary solution. This derivation could be generalised 
to the type of 3-layer network that was considered in section 13.11 , in which case 
a neighbourhod function Pr(y'\y) would emerge automatically. 

As n —> oo the stationary solution behaves as s —> ~ and D\ + D 2 (per 
unit length) —> , with overlapping linear neural response regions which cover 
the input space, so that for all x there are exactly two neurons that respond 
with equal and opposite linear dependence on x. As n —> oo the ratio of the 
number of firing events that occur on these two neurons is sufficient to determine 
x to O (A). When n = oo this stationary solution is s = ^ and D\ + D 2 (per 
unit length) = 0. However, when n = oo there are infinitely many other ways 
in which the neurons could be used to yield D\ + D 2 (per unit length) = 0, 
because only the D 2 term contributes, and it is 0 when x = Pr(y|a;) x'(y). 

This is possible for any set of basis elements x'(y) that span the input space, 
provided that the expansion coefficients Pr(y|a;) satisfy Pv(y\x) > 0. In this 1- 
dimensional example only two basis elements are required (i.e. M = 2), which 
are a/(l) = —oo and x'(2) = + 00 . More generally, for this type of stationary 
solution, M = dim x + 1 is required to span the input space in such a way that 
Pr(y|ir) > 0, and if M < dirnx + 1 then the stationary solution will span the 
input subspace (of dimension M — 1) that has the largest variance. 

The n = 1 and n —> 00 limiting cases are very different. When n = 1 the 
optimum network splits up the input space into non-overlapping quantisation 
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cells, and as n —> oo the optimum network does a linear decomposition of the 


input space using non-negative expansion coefficients. This behaviour occurs 
because for n > 1 the neurons can cooperate when encoding the input x, so that 
by allowing more than one neuron to fire in response to x, the encoded version 
of x is distributed over more than one neuron. In the above 1-dimensional 
example, the code is spread over one or two neurons depending on the value of 
x. This cooperation amongst neurons is a property of the coherent part D 2 of 
the upper bound on -DyQ (see equation [SJ. 

3.3 Factorial Encoder Network 

For certain types of distribution of data in input space the optimal network 
consists of a number of subnetworks, each of which responds to only a subspace 
of the input space. This is called factorial encoding, where the encoded input is 
distributed over more than one neuron, and this distributed code typically has 
a much richer structure than was encountered in section m 

The simplest problem that demonstrates factorial encoding will now be in¬ 
vestigated (this example was presented in [3], but the derivation given here is 
more direct). Thus, assume that the data in input space uniformly populates 
the surface of a 2-torus S ' 1 xS 1 . Each of the S ' 1 is a plane unit circle embedded 
in R 1 x R 1 and centred on the origin, and S ' 1 x S 1 is the Cartesian product of 
a pair of such circles. Overall, the 2-torus lives in a 4-dimensional input space 
whose elements are denoted as x = (xi, X2, X3, X4), where one of the circles lives 
in (xi, X 2 ) and the other lives in (x3,X4). These circles may be parameterised 
by angular degrees of freedom 612 and $ 34 , respectively. 

The optimal Pr(y|x) (i.e. a piecewise linear stationary solution of the type 
that was encountered in section 13.21 could be derived from this input data PDF 
Pr(x). However, the properties of the sought-after optimal Pr(//|x) are preserved 
if one restricts the solution space to the following types of Pr(y|x) 


^y,y(6 12) or fiy,y( 8 34 ) 

Pr(j/|x) = 5 y , y (e la ,e 34) 

, 2 i^y,yi2(0i2) 3 " ^y, 2/34(034)) 


type 1 


(26) 


type 2 


type 3 


where y(9 12 ) and 2 / 12 ( 012 ) encode 6 * 12 , y(0 34 ) and 2 / 34 ( 034 ) encode $ 34 , and 
2 /( 012 , 034 ) encodes ( 012 , 034 )- The allowed ranges of the code indices are 1 < 
2/(012) < M (and similarly 2 /( 034 )), 1 < 2 / 12 ( 012 ) < 4f, 4f + 1 < 2 / 34 ( 034 ) < M, 
and 1 < 2 /( 012 , 034 ) < M. The type 1 solution assumes that all M neurons re¬ 
spond only to 0 i 2 (or, alternatively, all respond only to 634), the type 2 solution 
assumes that all M neurons respond to ( 0 i 2 , 034 ), and the type 3 solution (which 
is very simple type of factorial encoder) assumes that neurons respond only 
to 0 i 2 , and the other 4^ neurons respond only to 034 . 

In order to derive explicit results for the stationary value of D\ + D 2 , it is 
necessary to optimise the x'(?/). The stationary condition on x'(y) may readily 
be deduced from the stationarity condition = 0 as 



M 
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If Pr(y|x) (and hence Pr(x|y)) are inserted into this stationarity condition, then 
it may be solved for the corresponding x'(y). 

Assume that the encoding functions partition up the 2-torus symmetrically, 
the three types of solution may be optimised as described in the following three 
sections. 


3.3.1 Type 1 Solution 

Assume that Pr(y|x) = 8 y , y ( Xl , X2 )i and that the y = 1 quantisation cell is the 
Cartesian product of the arcs \O\ 2 \ < jj and $34 < 27r of the 2 unit circles that 
form the 2 -torus, then the stationarity condition for x'(l) becomes 

M 1 f 2 ^ 

n — dd 12 — / dd 34 (cos 0 i 2 ,sin 6 * 12 ,cos 6 * 34 , sin 6 ^ 34 ) 

272nJ o 

m 1 r 27T 

= x'(l) + (n - 1 ) — / ddi 2 — / dd 3 4 x'(l) (28) 

7F d~W 7F dO 

which yields the solution x'(l) = (A- sin (-p) , 0, 0,0). The first two components 
are the centroid of the arc |i?| < of a unit circle centred on the origin. All 
of the x'(y) can be obtained by rotating x'(l) about the origin by multiples of 
jj. Using the assumed symmetry of the solution, the expression for D\ + £>2 
becomes 


(cos0 12 ,sin0 12 ) - (^- ,0^ 

1 /* 27r 

+ -/ d 9 3 4 ||(cos0 34 ,sin0 34 ) - (0,0)|| 2 (29) 

7T Jo 

where the first (or second) term corresponds to the subspace to which the neu¬ 
rons respond (or not respond). This gives the stationary value of D 3 + D 2 as 
Di + D 2 = 4 — sin 2 (-jg). Only one neuron can fire (given x), because 

Pr(y|x) = S y ^y^ 0 12 ) or (5 yi y(6» 34 ), no further information about x can be obtained 
after the first firing event has occurred, so this result for D 1 + D 2 is independent 
of n, as expected. 


M /■« 

D\ + D 2 = — / d 6 12 


3.3.2 Type 2 Solution 

Assume that the y = 1 quantisation cell is the Cartesian product of the arcs 
|$ 12 1 < -^= and |$ 34 1 < -j= of the two unit circles that form the 2-torus. The 
stationarity condition for x'(l) can be deduced from the type 1 case with the re¬ 
placement M —> VM, which gives x'(l) = S i n , 0, s in , 0 ). 

The expression for D\ + D 2 may similarly be deduced from the type 1 case as 
twice the first term in equation [29] with the replacement M —> \/~M, to yield 
the stationary value of D\ + D 2 as D\ + D 2 = 4 — sin 2 • As in the 

type 1 case, this result for D\ + D 2 is independent of n. 
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3.3.3 Type 3 Solution 

The stationarity condition for x'(l) can be written by analogy with the type 
1 case, with the replacement M —> 4^-, and modifying the last term to take 
account of the more complicated form of Pr(j/|x), to yield 

M 1 r 27r 

n— d 0 12 — / d0 34 (cos 0 i 2 , sin 0 i 2 , cos 0 3 4 , sin0 34 ) 

47T / 2. 27T /n 

J ~TT 

1 M 1 f 27T 

= x'(l) + -(n-l) — J ^ dd 12 - dd 34 x'(l) (30) 

where h Io K d0 34 E? _m +1 dy',y 3 i ( 9 3i ) x'(?/) = 0 has been used (this follows 
from the assumed symmetry of the solution). This yields the solution x'(l) = 
nTT (If sin(^j) >0,0,0). Using the assumed symmetry of the solution, the 
expression for D\ becomes 


D, = 


2 

( M 

2tt 

r~KT 


— 

/ d0 12 

n 

^471 j 

TT 


(cos0 12 ,sin0 12 ) - 


2 n M 


n + 1 V 27t 


Sill 


27T 

M 


,0 


1 /" 27r \ 

+ 2n J d ® 34 II (cos 0 34 , sin 0 34 )|| 2 j (31) 


and the expression for D 2 becomes 


D 2 


2 (n - 1) 

n 



(cos 0 i 2 , sin 0 12 ) 


n 

n + 1 



(32) 

This gives the stationary value of D\ + D 2 as D\ + D 2 =4 — sin 2 (|f) ■ 

Because Pr(y|x) = 4 {d y , yi 2 ( 6 12 ) + ^, 2 / 34 ( 034 ))) one firing event has to occur in 
each of the intervals 1 < y < 4/- and 4f + 1 < y < M for all of the information 
to be collected about x. However, the random nature of the firing events means 
that the probability with which this condition is satisfied increases with n, so 
this result for D 1 + D 2 decreases as n increases. 


3.3.4 Relative Stability of Solutions 

Collect the above results together for comparison. 



4- 

2 M 2 

^- sin 

7T 

2 (if) 

type 1 

D\ + D 2 = < 

4- 

4# sin 2 

7T^ 

fe) 

type 2 


A 

nM 2 

-n 2 (f) 

type 3 


>. 

(n+l)7r 2 


(33) 


For constant M and letting n —> 00 , the value of D\ + D 2 for the type 3 
solution asymptotically behaves as D\ + D 2 (type 3) —> 4 — ^4- sin 2 (jf), in 
which case the relative stability of the three types of solution is: type 3 (most 
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stable), type 2 (intermediate), type 1 (least stable). Similarly, for constant n 
and letting M —> oo, the relative stability of the three types of solution is: 
type 2 (most stable), type 3 (intermediate), type 1 (least stable). 

In both of these limiting cases the type 1 solution is least stable. If there 
is a fixed number of firing events n , and there is no ripper limit on the number 
of neurons M, then the type 2 solution is most stable, because it can partition 
the 2-torus into lots of small quantisation cells. If there is a fixed number of 
neurons M (which is the usual case), and there is no upper limit on the number 
of firing events n, then the type 3 solution is most stable, because the limited 
size of M renders the type 2 solution inefficient (the quantisation cells would be 
too large), so the 2-torus S 1 x S' 1 is split into two S 1 subspaces each of which 
is assigned a subset of neurons. If n is large enough, then each of these two 
subsets of neurons has a high probability of occurrence of a firing event, which 
ensures that both of the S' 1 subspaces are encoded. 

More generally, when there is a limited number of neurons they will tend to 
split into subsets, each of which encodes a separate subspace of the input. The 
assumed form of Pr(y|x) in equation 1201 does not allow an unrestricted search 
of all possible Pr(y|x). If the global optimum solution (which has piecewise 
linear Pr(y|x), as proved in section T3.2D cuts up the input space into partially 
overlapping pieces, then it is well approximated by a solution such as one of 
those listed in eauation l26l Typically, curved input spaces lead to such solutions, 
because a piecewise linear Pr(?/|x) can readily quantise such spaces by slicing 
off the curved “corners’” that occur in such spaces. 


4 Conclusions 

In this paper an objective function for optimising a layered network of discretely 
firing neurons has been presented, and three non-trivial examples of how it is 
applied have been shown: topographic mapping networks, piecewise linear de¬ 
pendence on the input of the probability of a neuron firing, and factorial encoder 
networks. Many other examples could be given, such as combining the first and 
third of the above results to obtain factorial topographic networks, or extending 
the theory to multilayer networks, or introducing temporal information. 
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